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Abstract 

Modeling the dependence between outputs is 
a fundamental challenge in multilabel classifi¬ 
cation. In this work we show that a generic 
regularized nonlinearity mapping independent 
predictions to joint predictions is sufficient to 
achieve state-of-the-art performance on a vari¬ 
ety of benchmark problems. Crucially, we com¬ 
pute the joint predictions without ever obtain¬ 
ing any independent predictions, while incorpo¬ 
rating low-rank and smoothness regularization. 
We achieve this by leveraging randomized algo¬ 
rithms for matrix decomposition and kernel ap¬ 
proximation. Furthermore, our techniques are 
applicable to the multiclass setting. We apply our 
method to a variety of multiclass and multilabel 
data sets, obtaining state-of-the-art results. 


1. Introduction 

In multilabel classification the learner is given an example 
and is asked to output a subset of the labels that are rele¬ 
vant for that example. Multilabel problems come up often 
in web applications where data of interest may be tagged 
with zero or more tags representing, for example, the ob¬ 
jects in an image, the topics of a news story, or the relevant 
hashtags of a tweet. A simple approach to multilabel pre¬ 
diction is to learn one classifier per label and predict each 
label independently. Why should we expect that we can im¬ 
prove upon this procedure? One answer is because in many 
real world multilabel problems the labels are correlated. A 
blog post about cooking has a higher chance of also being 
related to nutrition than to finance. 

In this paper we propose a simple and efficient technique 
for multilabel classification, in which the dependencies be¬ 
tween predictions are modeled with a flexible link function. 
This corresponds to the rich literature of using a stacking 
architecture (Wolpert, 1992) in the multilabel context to 
combine independent predictions to produce a joint predic¬ 
tion (Godbole & Sarawagi, 2004). However, many modern 
multilabel (and multiclass) problems are characterized by 


increasingly large output spaces. Such large output spaces 
present both computational and statistical challenges. The 
former is easily understood: a straightforward approach 
will scale at least linearly with the number of outputs both 
during training and inference. The latter also presents sev¬ 
eral challenges: naive models can have too many parame¬ 
ters, while certain labels and features can be extremely rare. 
Predicting each label must therefore borrow strength from 
other labels to avoid overfitting. 

Our approach to mitigating statistical difficulties is via reg¬ 
ularization; specifically we are interesting in enforcing both 
low-rank and smoothness. To mitigate the computational 
burden of this regularization, we leverage randomized algo¬ 
rithms for embedding (Mineiro & Karampatziakis, 2014) 
and kernel approximation (Rahimi & Recht, 2007). For¬ 
tunately, these procedures can be combined in a manner 
which avoids expensive intermediates and scales to the 
largest public benchmark data sets. The technique also 
applies to multiclass classification, which is just a special 
case of multilabel prediction where exactly one of the la¬ 
bels must be output. To illustrate this our experiments in¬ 
clude both multilabel and multiclass data. 

In the rest of the paper we overview a randomized algo¬ 
rithm for embedding (Section 2), we introduce our tech¬ 
nique as fitting a flexible link function (Section 3), and we 
argue that statistical (Section 4) and computational (Sec¬ 
tion 5) improvements are possible if we use randomized 
embedding on the inputs to the link function. We present 
related work on Section 6 and experiments on Section 7. 

2. Background and Notation 

Here we will introduce notation and review randomized 
embedding, a technique we leverage and adapt in the se¬ 
quel. For additional background on randomized linear al¬ 
gebra, we refer the reader to (Halko et al., 2011). 

2.1. Notation 

Vectors are denoted by lowercase letters x, y etc. and ma¬ 
trices by uppercase letters W, Z etc. We are given n ex¬ 
amples as matrices X S and Y S We assume 
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Algorithm 1 Randomized Embedding 
1: Input: X, Y, k,i, q 
2: Q ^ randn(c, k + i) 

3: for i S {1, . . . , q} do 

4: Z ^ argmin^ggdxcfc+o \\YQ - XZ\\], 

5: Q ■(— orthogonalize(y^AZ) 

6: end for 

7: F ^ {Y^XQY {Y^XQ) 

8: (?7,E2)^eig(F,fc) 

9: U^QU 
10: Return U 


that the labels Y are sparse: each example is typically as¬ 
sociated with only a handful of labels. We use for 

the Frobenius norm, E[-] for the expectation operator. 


correlations among the predictions. The simplest way to do 
this is to introduce a link function as an operator that maps 
a vector of activations p to a vector of predictions y = Cp. 
A very popular operator in the setting of multiclass clas¬ 
sification is the softmax: g{p) = V log exp(pi). This 
is an oblivious operator whose effect is to boost the largest 
activation while ensuring the predictions form a probabil¬ 
ity distribution. In multilabel classification, we would like 
to introduce a non-oblivious operator, because we do not 
know a priori which labels frequently occur together. 

A first possibility is a linear operator i.e. C is a c x c matrix. 
A natural choice in this case is 



i=l 


2.2. Randomized Embedding 

Algorithm 1 is a recipe for approximating, via ran¬ 
domized linear algebra techniques, the label projec¬ 
tion matrix associated with the optimal rank-constrained 
least squares prediction of the labels given the features 
(Mineiro & Karampatziakis, 2014). The inputs are the data 
matrix X G the label matrix Y G the desired 

rank k, and hyperparameters £ and q. The algorithm is in¬ 
sensitive to the parameters I and q as long as they are large 
enough (in our experiments we use £ = 20 and q = 1). We 
start with a set of fc -|- £ random vectors and use them to 
probe the range of Y^Hx^lY , where IVx,l is the projec¬ 
tion onto the left singular subspace of the data matrix. We 
compute an orthogonal basis for the range and refine it by 
repeating q times. This can also be thought as orthogonal 
(aka subspace) iteration for finding eigenvectors with early 
stopping (i.e., q is small). Afterwards we use our approxi¬ 
mation of the principal subspace to optimize fully over that 
subspace (lines 7 and 8) and back out the solution (line 9). 
These last few steps are cheap because we are only work¬ 
ing with a(fc-|-£)x(fc-|-£) matrix, and in practice the most 
expensive step is the least squares fit to compute Z (line 4). 
Note the least squares fit is a regression problem in fc -f £ 
outputs, independent of c, whose computation beneficially 
exploits the sparsity of Y. 

Note once the label projection operator has been obtained, 
the feature projection operator can be computed with a final 
least squares fit using the projected labels as targets. 


and with go{p) = Cp we see that the link function is tak¬ 
ing a weighted combination of training labels weighted by 
the inner product between the activation and each label. 
Though intuitive, this link function does not do much: the 
predictions are a fixed linear transformation of the activa¬ 
tions with no adjustable parameters. This can be easily 
fixed by introducing a non-negative scalar ai for each la¬ 
bel: 


9iip) = ^oiiy^yJp, 


( 1 ) 


so that labels with a large ai influence the prediction more 
towards themselves. This link function is now evidently a 
linear kernel machine with a playing the role of dual vari¬ 
ables. Our next step is to generalize the link function by 
replacing the inner product in input space with a kernel 
function, an inner product in a reproducing kernel Hilbert 
space: 

92{p) ='^aiy^K{yi,p) ( 2 ) 


In practice we will approximate the kernel function 
with a finite sum of basis elements. There are sev¬ 
eral techniques for doing this, including the Nystrom 
method (Williams & Seeger, 2001), Incomplete Cholesky 
Factorization (Fine & Scheinberg, 2002), and Random 
Fourier Features (RFFs) (Rahimi & Recht, 2007). Here we 
will use RFFs which are especially easy to work with shift 
invariant kernels (i.e. K{y,p) = K{y — p)) and have the 
advantage that they can be generated in an oblivious way, 
without knowing how the activations p look like. This will 
prove important for the efficiency of the final algorithm. 


3. A Flexible Link Function 

We begin with the premise that in many multilabel prob¬ 
lems certain label combinations frequently co-occur. We 
would like to take advantage of such co-occurrence pat¬ 
terns, but we cannot jump directly into modeling the out¬ 
put space because the classifier can only reach that space 
through the features it uses. Therefore, we will try to model 


Although it is not yet obvious, for our final algorithm 
we believe (and present some empirical justification) that 
shift-invariant kernels are broadly applicable in this con¬ 
text. Recall that applying £2 regularization on a kernel ma¬ 
chine with a shift-invariant kernel, such as the Gaussian 
kernel, is the same as penalizing the derivatives of all or¬ 
ders of the functions that the kernel machine can represent 
(Smola & Scholkopf, 1998). Such a smoothness prior is 
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Algorithm 2 Multilabel via Smooth Link 

1: Input: X, y 

2: Fit independent binary classifiers: pij = fj{xi) 

3: Draw random vectors rt and biases bt t = 1,..., s 
4: Featurize activations: (j)t{xi) = cos{rJ f{xi) + bt) 
5: Learn weight matrix V 

V = argminloss(y, $(A)F) 

ygRoxc 


very appropriate for our link function: we are applying ker¬ 
nels to the activations (or, later, to projected activations), 
not to the feature space. 

Recall that for shift invariant kernels 

K(y,p) (X E(^^i,)^gxU(o, 27 r) cos(r^y -b b) cos(r'^p + b) 

where the distribution q depends on the kernel and U{a, b) 
is the uniform distribution in (a, b). Therefore we can ap¬ 
proximate p 2 by 

n 1 ^ 

53(f) = X] X! cos(rJ Pi + bj) cos{rJp + bj) 

i=i ® i=i 

We now replace ^ X)r=i *^*5* cos(rJ yt -b bj) by optimiza¬ 
tion variables Vj and arrive at 

S 

9{p) = p+bj) 

At this point we have the two-stage procedure outlined in 
Algorithm 2. For multilabel problems, the loss function 
in the final fit can be either least squares or the sum over 
classes of per-class logistic loss. The latter usually requires 
a smaller s to attain the same result, while the former ad¬ 
mits a fast fitting procedure (Vincent, 2014) that is inde¬ 
pendent of the size of the output as long as the output is 
sparse. As we argue in the next sections. Algorithm 2 is ac¬ 
tually quite wasteful both statistically and computationally. 
In the following sections we will address these problems 
using randomized embedding. 

4. Dealing with Statistical Issues 

Though we started with assuming that certain combinations 
of labels (and activations) are more correlated than others 
we have not yet clarified what is the exact property we will 
be exploiting. Correlations between activations are cap¬ 
tured in their empirical second moment and our assumption 
is that this matrix is low rank and therefore can be described 
by /c <C c eigenvectors: 

-i n k 

E[pp^] -^PtpJ = X! 

2=1 2=1 


Empirically we have found this assumption to hold for 
many multilabel problems. Furthermore, similar ideas, 
such as assuming that the second moment of the labels 
is low rank, have been employed in other techniques for 
multilabel problems, e.g., (Tai & Lin, 2012). The key ben¬ 
efit of using activations instead of labels is that our pro¬ 
posed method applies equally well to multiclass problems 
whereas the method of (Tai & Lin, 2012) would yield triv¬ 
ial results in the multiclass setting. 

How can the procedure outlined in Algorithm 2 be im¬ 
proved by taking advantage of the fact that activations are 
low rank? When activations are nominally in but re¬ 
ally only span a smaller subspace of dimension k then the 
kernel approximation employed in lines 3 and 4 of Algo¬ 
rithm 2 is very wasteful. To illustrate this point we will, for 
simplicity, focus on kernels that are not only shift invariant, 
but also rotationally invariant: K{p,p') = K,{\\p — p'\\ 2 ). If 
p,p' only span a space of dimension k then there exists a 
k X c matrix such that ||_p — p ^||2 = \ \U^P — U^p'\\ 2 - 
Moreover the matrix U is given by the top k eigenvectors 
of E\pp^]. We can therefore reduce the dimensionality of 
the activations p with randomized embedding before apply¬ 
ing our kernel approximation. This reduces the variance, 
or, alternatively, requires drawing fewer random vectors to 
achieve the same level of approximation. 

Since each feature function is now computing 
cos{rJ U^p + b) the random vectors that we use to 
project the activations are now Urt with rt G K^. Further¬ 
more, their covariance is equal to UT,U^ with E G 
being the covariance of the sampling distribution for rt 
(typically a multiple the identity). On the other hand, 
Bochner’s theorem tells us that there’s a one to one map¬ 
ping from sampling distributions to positive definite shift 
invariant kernels. Therefore, projecting the activations is 
equivalent to tuning the kernel to the observed data. 

5. Dealing with Computational Issues 

There are three computational issues with the algorithm as 
proposed thus far. First we need to fit individual classifiers 
for each problem which can be very time consuming if the 
number of labels c is large. Second, when c is large forming 
the empirical second moment of the activations and com¬ 
puting the top eigenvectors might be infeasible. Third, the 
final optimization over the matrix V still requires the solu¬ 
tion of a large number of problems. Here we address all of 
them. 

We start with the issue of fitting the s x c matrix V. One 
possibility is to treat each of the columns of V in parallel 
as each of them can be learned independently: by this stage 
we have finished modeling dependencies among labels. For 
square loss we can alternatively use the recent technique of 
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(Vincent, 2014) that shows how to perform stochastic gra¬ 
dient updates for least squares problems when the output 
is large but sparse. This method only requires O(s^) com¬ 
putation instead of 0(sc) where s is the number of basis 
functions (i.e., cosines) we use. 

We tackle the other two issues together. Our key observa¬ 
tion is that in order to run the algorithm we only need to 
have the projections of the activations. Surprisingly, for the 
case of linear (and kernel) classifiers it is possible to ob¬ 
tain these without fitting all individual classifiers, via the 
randomized embedding procedure previously introduced. 

Algorithm 3 puts the above ingredients together. Lines 
2 through 10 are randomized embedding, which approx¬ 
imates the label projection operator. The additional least 
squares fit of Line 11 yields the (approximate) feature pro¬ 
jection operator. Lines 12 through 14 are algorithm 2, but 
applied in the low-dimensional space. In particular, each 
rt e 

A few remarks are in order. For simplicity, we have spec¬ 
ified fc as a parameter but randomized embedding can also 
incorporate a strategy for increasing k if initial estimates 
of the captured variance are too low (Halko et al., 2011). 
To improve generalization performance, we use a regular¬ 
ized least squares fit for the randomized embedding solves 
in lines 5 and 11 of Algorithm 3. Regularization is less 
crucial (and sometimes detrimental) for learning the final 
V matrix, but this is not surprising given the regularization 
inherent in the other aspects of the procedure. 

Another (implicit) parameter of the algorithm is the sam¬ 
pling distribution of the vectors rt- This distribution defines 
the choice of shift invariant kernel we will be using to mea¬ 
sure similarities between activations p and labels y. For¬ 
tunately, we can offer some guidance here using the spec¬ 
tral properties of various shift-invariant kernels (see also 
(Le et al., 2013) for details). We recommend using Gaus¬ 
sian and Cauchy respectively for low and high dimensional 
y. These are special cases of the multivariate Student dis¬ 
tribution with V = oo and v = 1 degrees of freedom. In¬ 
termediate values such as z/ = 3 and z/ = 5 can offer better 
results for medium dimensional y. The corresponding ker¬ 
nels are from the Matern family. Some empirical support 
for their superiority on medium to high dimensional vectors 
is offered in (Le et al., 2013). 

Algorithm 3 describes training but not inference. In prac¬ 
tice we find Algorithm 3 can be used with multiple infer¬ 
ence procedures if we use a proper scoring rule for equa¬ 
tion (3), such as squared loss\ logistic loss (for multiclass), 
or sum over classes of per-class logistic loss (for multil¬ 
abel). Inference can be per-example for example-averaged 
metrics such as Hamming loss or precision-at-fc, whereas 

*This is a poor choice for data sets where some labels are rare. 


Algorithm 3 Multilabel via Smooth and Low-Rank Link 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


Input: X,Y,k,X 
(f,g)^(20,l) 

Q -(r- randn(c, k + i) 

for f G {1,..., g} do 

Z •<— argmin^gRdxCfc+j!) \\YQ — XZW^p A||Z^|||i 

Q ■(— orthogonalize(V^ArZ) 

end for 

F ^ [Y^XQYiY^XQ) 

(C/,S2)^eig(F,fc) 

U ^QU 

W = argmin^gRdxfc \ \YU - XW\\jp + A||kF|||. 
Draw random vectors rt and biases btt = 1,..., s 
Featurize projected predictions 


= cos(rJ Wxi + ht) 


14: Learn weight matrix 

V = argminloss(y,$(A:)V) -b A||V|||. (3) 


per-class inference is better for macro-averaged Fi score, 
a popular evaluation metric in multilabel tasks. Inference 
procedures are discussed in detail in the experiments. 

6. Related Work 

The algorithms presented here are examples of a stack¬ 
ing architecture (Wolpert, 1992), in which the predictions 
of (simpler) classifiers are used as input to another model 
which produces a (hopefully improved) joint prediction. 
In particular, the idea of combining independent predic¬ 
tions using stacking in the multilabel setting was proposed 
by (Godbole & Sarawagi, 2004), and has been thoroughly 
explored, e.g., (Montanes et al., 2011; Nametal., 2014; 
Read & Hollmen, 2015). Our contributions to this line of 
research are twofold. First, we empirically demonstrate 
that kernel machines, with shift invariant kernels approxi¬ 
mated in the primal, provide an effective and computation¬ 
ally convenient choice for the final classifier. Second, we 
train our pipeline without fitting c independent classifiers 
as a first step. 

Our technique can also be viewed as a calibration pro¬ 
cedure, as the second stage is nothing but a link func¬ 
tion mapping independent predictions to joint predic¬ 
tions. Many calibration procedures (Platt et al., 1999; 
Zadrozny & Elkan, 2001; Kakade et al., 2011) have fo¬ 
cused on binary classification and they are now widely used 
in applications together with diagnostic tools such as cal¬ 
ibration plots. Calibration for multiclass and multilabel 
classification has received little empirical attention. A no¬ 
table exception for multiclass is (Zadrozny & Elkan, 2002) 
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which first produces calibrated probability estimates for in¬ 
duced binary problems and then combines these estimates 
to a final estimate of the posterior probability of each class. 
However, it is not clear how well calibrated the final mul¬ 
ticlass estimates are. In high dimensions, given hardness 
results (Hazan & Kakade, 2012) and a lack of diagnostic 
tools, our approach follows a more pragmatic route; select 
a family of flexible link functions via a kernel machine pa¬ 
rameterization, then learn an efficient approximation to a 
good link function in that family using random features. 

The use of embeddings is pervasive in the multilabel liter¬ 
ature, as it provides both computational and statistical ben¬ 
efits. (Weston et al., 2002) embed both features and labels 
into a common vector space, and use pre-image search for 
inference. Our procedure can be considered a more effi¬ 
cient analog of this where the search is approximated by in¬ 
ference. (Hsu et al., 2009), motivated by advances in com¬ 
pressed sensing, utilized a random embedding of the labels 
along with a sparse decoding strategy. Our embedding is 
not oblivious and takes into account both inputs and out¬ 
puts, and in our experimental section we demonstrate a re¬ 
sulting improvement. (Bengio et al., 2010) combined a tree 
based decomposition with a low-dimensional label embed¬ 
ding, jointly learning both. End-to-end fine-tuning of our 
architecture is conceptually straightforward and plausibly 
useful, but this is outside the scope of this paper. The prin¬ 
cipal label space transform (PLST) (Tai & Lin, 2012) con¬ 
structs a low-dimensional embedding using principal com¬ 
ponents on the empirical label covariance, which is then 
utilized along with a greedy sparse decoding strategy. Be¬ 
cause it uses the empirical label covariance, PLST is not 
applicable to the multiclass setting. 

The conditional principal label space transformation 
(CPLST), another dimensionality reduction approach to 
multilabel classification, has strong connections to our 
technique. In particular (Chen & Lin, 2012) initially 
SVD the same matrix as in algorithm 3, denoted here 
as XZ*, albeit without leveraging randomized tech¬ 
niques. The similarities are intriguing given that CPLST is 
motivated by a bound on Hamming loss. However, CPLST 
solves an optimization problem which is designed to make 
an independent decode strategy effective, and apply kernel- 
ization to the feature space; whereas we learn a decoder and 
apply kernelization to the decoding problem, i.e., predic¬ 
tions. We attribute the broad applicability of shift-invariant 
kernels on the output to the shared statistical structure of 
multilabel problems encountered in practice, as opposed to 
the diverse statistical structure of features, e.g., sparse high- 
cardinality text vs. dense low-cardinality images. In other 
words, the choice of kernel in our procedure is greatly sim¬ 
plified. 

Tree-based approaches are another major category of mul- 


Table 1. Dataset details. 

Dataset I Domain I Examples I Inputs I Outputs 


MULAN.SOURCEPORGE.NET 


BIBTEX 

TEXT 

7395 

1836 

159 

corel5k 

IMAGE 

5000 

499 

374 

MEDIAMILL 

VIDEO 

43K 

120 

101 

YEAST 

BIOLOGY 

2417 

103 

14 


OTHER 


INDUSTRIES 

TEXT 

23K 

47K 

354 

ODP 

TEXT 

1.5M 

0.5M 

lOOK 

LSHTC 

TEXT 

2.4M 

1.6M 

325K 


tilabel learning algorithms. Due to the richness of the liter¬ 
ature, we refer the reader to a survey paper (Cerri et al., 
2014). Here we discuss LastXML (Prabhu & Varma, 

2014) , a multilabel tree ensemble method for which we 
have direct experimental comparisons. LastXML makes 
several design choices to mitigate the computational ex¬ 
pense of applying decision trees to high label cardinality 
(aka extreme) multilabel problems. In particular, LastXML 
partitions the feature space, in contrast to some approaches 
that partition the label space. Lurthermore, LastXML also 
avoids solving an expensive label assignment problem at 
each node by using the union of the labels encountered 
in the training set (ordered by empirical frequency). This 
yields state of the art performance on multiple datasets 
when using a precision-at-fc metric. 

7. Experiments 

Unless otherwise indicated, confidence intervals on test set 
metrics are 90% confidence intervals and are estimated us¬ 
ing the bootstrap (Efron & Tibshirani, 1994). 

7.1. Small Benchmark Datasets 

We begin by demonstrating the effectiveness of the 
technique on several multilabel data sets from mu- 
lan.sourceforge.net (Tsoumakas et al., 2010), utilizing 
their train-test splits. Table 1 lists the details of these 
datasets, which span several application domains. All 
datasets are used as-is without preprocessing. 

We compare our approach with the best reported re¬ 
sults from the multilabel survey paper of (Metz et al., 

2015) , which considered 1543 publications using the mu- 
lan datasets, disqualifying all but 64 for reasons such as du¬ 
plicate papers with different titles, or using preprocessing 
which is not publicly available. We further confirmed that 
the best reported result from (Metz et al., 2015) was at least 
as good as the results reported for PLST and CPLST. The 
metrics we used are example-based Hamming Loss (HL) 
and label-based macro-averaged Fi {F^), defined in the 
standard way (Metz et al., 2015). 















Scalable Multilabel Prediction via Randomized Methods 


Algorithm 4 Fi inference for a single class. Z are the 
predicted label probabilities for the test set, and p is the 
empirical frequency of the label on the training set. 

1: Input: Z,p 

2: Estimate number of positives: p\Z\ 

3: Sort probabilities: ^sorted ■<— sort(Z,'descend') 

4: Estimate denom: r ^ {n+ + 1,..., n+ + |Zsorted|} 

5: Estimate f: / ■<— 2 cumsum(Zsorted)-/r 
6: Maximize estimate: m* = argmax / 

7: Threshold at m*: Y {Z > .^sorted(m*)) 

8: Return Y 


Table 2. Mulan Test Set Metrics 


Metric 

Dataset 

Survey 

This Paper 


BIBTEX 

0.01 

[0.0138, 0.0145] 

HL (4) 

corel5k 

0.01 

[0.0102, 0.0106] 


MEDIAMILL 

0.03 

[0.0310,0.0316] 


YEAST 

0.30 

[ 0 . 212 , 0.226 ] 


BIBTEX 

0.32 

[0.301, 0.333] 

Ff (t) 

corelSk 

0.04 

[ 0 . 0511 , 0 . 0583 ] 


MEDIAMILL 

0.19 

[ 0 . 238 , 0 . 258 ] 


YEAST 

0.87 

[0.428,0.450] 


Training was via Algorithm 3, with sum over classes of 
per-class logistic loss used for equation (3). We used a 
random feature approximation to a Gaussian kernel, in¬ 
troducing two additional hyperparameters (bandwidth and 
number of basis functions). Optimization of equation (3) 
was via preconditioned SGD with momentum, introduc¬ 
ing four additional optimization-related hyperparameters 
(learning rate, learning rate decay, momentum, and num¬ 
ber of data set passes). The embedding dimension k was 
chosen to capture 90% of the empirical label covariance. 
All other hyperparameters were determined by extracting 
a 10% holdout set from the original training set and opti¬ 
mizing via random search (Bergstra & Bengio, 2012); once 
hyperparameters were determined a final assessment was 
done using the original train-test split. Note hyperparame¬ 
ters were tuned separately for each metric. 

Inference is done per-example for Hamming Loss, and 
per-class for . Per-example inference merely thresh¬ 
olds the predicted probability of each class at 1/2. Per- 
class Fi inference is described in Algorithm 4, which 
is a simplified version of the Fi inference procedure in 
(Dembczynski et ak, 2013). Prom Table 2 we conclude our 
approach is typically competitive with the best reported re¬ 
sults and sometimes superior. 

7.2. Is it Just About Flexibility? 

Our approach seems to be working well in practice but a 
reasonable question at this point is where is the better per¬ 
formance stemming from? Since we have a two stage pro¬ 
cedure with non-linear features at the second step, could it 


Table 3. ODP results, k = 300 for both embedding strategies. 


Method 

Alg 3 

CS -H LR 

LT 

Test 

Error(%) 

[83.21, 83.39] 

[85.39, 85.56] 

93.46 


be the case that we could have just obtained the same re¬ 
sults by a much simpler method? Here we illustrate that 
simple methods such as blindly using a kernel (approxima¬ 
tion) directly to the inputs and predicting the labels inde¬ 
pendently produce very different results than our judicious 
use of flexibility to model inter-label dependencies. 

We focus on the RCV1 industries dataset that has a small 
training set, compared to its number of features. Based 
on this, we should expect that naive application of flexi¬ 
ble modeling can lead to decreased generalization perfor¬ 
mance. Indeed, we performed three experiments: learn 
individual logistic regressions to predict each of the 354 
categories, learn individual kernel logistic regressions with 
a Gaussian kernel approximation (the bandwidth was se¬ 
lected as in the previous experiments), and Algorithm 3 
with squared loss for equation (3). The best test Ham¬ 
ming loss of 0.00159 was attained by Algorithm 3. Sec¬ 
ond was the independent logistic regression with a loss of 
0.00163. Einally, learning independent kernel logistic re¬ 
gressions had a loss of 0.00193. Using the bootstrap, we 
obtained confidence intervals at most ±7 • 10“® so none of 
the confidence intervals overlap. 

7.3. ODP 

The Open Directory Project (DMOZ, 2014) is a public 
human-edited directory of the web which was processed by 
(Bennett & Nguyen, 2009) into a multiclass data set. Here 
we compare with (Choromanska & Langford, 2014), uti¬ 
lizing the same train-test split, features, and labels. Specif¬ 
ically there is a fixed train-test split of 2:1 for all experi¬ 
ments, the representation of document is a bag of words, 
and the unique class assignment for each document is the 
most specific category associated with the document. 

Training was via Algorithm 3, with logistic loss used for 
equation (3). We used a linear kernel, i.e., in lieu of lines 
12 and 13 we merely use the embedding directly for equa¬ 
tion (3). Essentially this is rank-constrained logistic regres¬ 
sion, i.e., we are only exploiting the regularization of Algo¬ 
rithm 3 rather than the increased flexibility. Optimization 
of equation (3) was via preconditioned SGD with momen¬ 
tum, introducing four additional optimization-related hy¬ 
perparameters (learning rate, learning rate decay, momen¬ 
tum, and number of data set passes). The embedding di¬ 
mension and other hyperparameters were tuned by hand on 
a 10% validation set extracted from the training set. 

In Table 3 we compare against Lomtree (LT), which has 
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training and test time complexity logarithmic in the num¬ 
ber of classes (Choromanska & Langford, 2014). LT is 
provided by the Vowpal Wabbit (Langford, 2007) machine 
learning tool. We also compare against a compressed sens¬ 
ing analog of Algorithm 3 (CS + LR), where the matrix U 
computed in lines 2 through 10 is replaced by a Gaussian 
random matrix. Both variants of Algorithm 3 outperform 
Lomtree, which is unsurprising given Lomtree’s emphasis 
on computational complexity. Furthermore, there is a lift 
from using a tuned embedding instead of a random one. 

To the best of our knowledge, this is the best published 
result on this dataset. 

On a standard desktop^ it takes approximately 6000 sec¬ 
onds to compute the embedding (lines 2-11 of Algo¬ 
rithm 3), of which 3000 seconds is due to line 11 (which 
is also necessary with compressed sensing). Each data pass 
for optimizing equation (3) takes 1200 seconds, and vali¬ 
dation error starts to turn up after 30 iterations. 

7.4. LSHTC 

The Large Scale Hierarchical Text Classification Challenge 
(version 4) was a public competition involving multilabel 
classification of documents into approximately 350,000 
categories (Kaggle, 2014). The training examples and la¬ 
bels and the test examples are available from the Kaggle 
platform. The features are bag of words representations of 
each document. 

Training was via Algorithm 3, with sum over classes of per- 
class logistic loss used for equation (3). We found Cauchy 
distributed random vectors, corresponding to the Laplacian 
kernel, gave good results. Optimization of equation (3) was 
via preconditioned SGD with momentum, introducing four 
additional optimization-related hyperparameters (learning 
rate, learning rate decay, momentum, and number of data 
set passes). The embedding dimension and other hyperpa¬ 
rameters were tuned by hand, although we were ultimately 
limited by available memory on our standard desktop^, i.e., 
it appears a larger (than 800) embedding dimension and 
more (than 4000) basis functions would further improve re¬ 
sults. 

We used per-example inference, thresholding each classes 
predicted probability at 1/2. This produces poor results 
on macro-averaged FI, which is what the Kaggle oracle 
uses to evaluate submissions. However it does well on 
example-based metrics. We compare with published results 
of (Prabhu & Varma, 2014), who report example-averaged 
precision-at-fc on the label ordering induced for each exam¬ 
ple. To facilitate comparison we do a 75:25 train-test split 
of the public training set, which is the same proportions as 
in their experiments (albeit a different split). 

^dual 3.2Ghz Xeon E5-1650 CPU and 48Gb of RAM. 


Table 4. LSHTC results. 


Method 

Alg 3 

FastXML 

LPSR-NB 

Test 

Precision- AT- 1 (%) 

53.39 

49.78 

27.91 


Table 4 contains the results. LPSR-NB is the Label Parti¬ 
tioning by Sub-linear Ranking algorithm of (Weston et al., 
2013) composed with a Naive Bayes base learner, as re¬ 
ported in (Prabhu & Varma, 2014), where they also intro¬ 
duce and report precision for the multilabel tree learning 
algorithm FastXML. 

8. Conclusions 

In this paper we have proposed a procedure for learning 
a flexible regularized link function for multilabel (as well 
as multiclass) problems. Our procedure empirically works 
better than many strong baselines and scales to large output 
spaces. Thus, similar to (Read & Hollmen, 2015), we con¬ 
clude that simple label dependency models lead to state of 
the art computational and statistical performance. 

In the future we plan to investigate the applicability and ef¬ 
fectiveness of analogous procedures in more complex out¬ 
put spaces where the dependency structure of the output 
variables is specified by a graphical model. 
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